Using a Hugging Face Model with Llama.CPP 
Docs 


https://github.com/ggerganov/llama.cpp 


PYTHON WRAPPER (my holy grail for docs now): 


https://llama-cpp-python.readthedocs.io/en/latest/ 


Listen these LLM folk are more MLM , jk. But you do need a 
GGUF file. 


This video kind of helped a lot? 
Helped piece the idea that | really need to quantize it or get a gguf file. 
https://www. youtube.com/watch?v=jOEU0PE40zM 


Pull this stuff, 


step 1.1: 


LOGIN AND STUFF 
https://huggingface.co/docs/huggingface_hub/en/guides/cli#huggingface-cli-login 


download the model 
https://huggingface.co/docs/huggingface_hub/en/guides/cli#huggingface-cli-download 


| downloaded the whole thing (REFERENCE VIDEO AT THE TOP, WAS HELPFUL ALSO A 
LOT OF SOURCES TELL YOU IT'S USUALLY FOUND IN THE ~CACHE) 
https://huggingface.co/meta-llama/Meta-Llama-3-8B 


Once you have this download you can move onto Ilama.cpp 


video helps guide one to getting to this file if you need it 


step 2 https://github.com/ggerganov/llama.cpp?tab=readme-ov- 
file#build 


Here | made a new dir 
mkdir llamaCPP 
git clone > clone repo 


cd into repo, 
https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#cuda 


$ https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#cuda 


after was able to run stuff here: step3 
# obtain the official LLaMA model weights and place them in ./models 
ls ./models 
llama-2-7b tokenizer_checklist.chk tokenizer .model 
# [Optional] for models using BPE tokenizers 
ls ./models 
<folder containing weights and tokenizer json> vocab.json 
# [Optional] for PyTorch .bin models like Mistral-7B 
ls ./models 


<folder containing weights and tokenizer json> 


# install Python dependencies 


python3 -m pip install -r requirements.txt 


# convert the model to ggml FP16 format 
python3 convert-hf-to-gguf.py models/mymodel/ 


# quantize the model to 4-bits (using Q4 _K_M method) 
-/llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model - 
Q4_K_M.gguf Q4 K_M 


# update the gguf filetype to current version if older version is now unsupported 
./llama-quantize ./models/mymodel/ggml-model-Q4 K_M.gguf ./models/mymodel/ggml- 
model-Q4 K_M-v2.gguf COPY 


Step 3 https://github.com/ggerganov/llama.cpp?tab=readme-ov- 
file#prepare-and-quantize 


Quantize the model, 


DYOR 


all random stuff i've looked up 


Not bad resources 
https://medium.com/@ingridwickstevens/quantization-of-llms-with-llama-cpp-9bbf59deda35 


Like | didn't do this, there's a lot of research, don't overwhelm yourself 


https://github.com/ggerganov/llama.cpp/issues/1344 


my specs, 

https://www.google.com/search? 
g=cantitruntllama+3+8b+with+a+3080+site:www.reddit.com&sca_esv=ba8d56c099ffe98e&sx 
srf=ADLYWIKY-t)Mwps6h-XzWzlexXOwon6- 
28g9:17182805281258&sa=X&ved=2ahUKEwjokIPkxdiGAxVDSZABHctnCy4QrQloBHoECCYQB 
Q&biw=1920&bih=919&dpr=1#ip=1 


read about q6_k 

https://www.google.com/search? 

g=quantize+the+model+to+Q6 K&sca_esv=ba8d56c099ffe98e&sxsrf=ADLY WILiI4Ogav6biasP 
G7wlhTrkWVYaszw%3A17182803802468&ei=vOBgZr3cDq_qkvQPx_qL6Ag&ved=O0ahUKEwi9q 
sGdxdiGAxUvtY QIHUf9A00Q4dUDCBA&uact=5&oq=quantizet+the+model+to+Q6 Ké&gs_lp=Eg 
xnd3Mtd2I6LXNicnAiGnF1YW50aXplIHRoZSBtb2RibCBObyBRNISLMgUQIRigATIFECEYoAEy 
BRAhGKABMgUQIRIgATIFECE YoAF ItcUBUJY CWJXBAXAACAKQAQCYAWigAcsDqgEDNC4x 
uAEDyAEA-AEB- 
AECmAIGoALeA8ICBBAAGEfCAgUQABiIABMICBhAAGBYYHsICCxAAGIAEGIYDGIOFwgIIEA 
AYgAQYogTCAggQABgWGAoYHpgDAIgGAZAGCJIHAZUuMaAHgB8&sclient=gws-wiz-serp 


reddits 


1. https://www.reddit.com/r/LocalLLaMA/comments/183ie9t/ctransformers_vs_llamacpppytho 
n_which_one_should/ 


N 


. https://www.reddit.com/r/LocalLLaMA/comments/1c8u0n5/thanks_zuckmeta_for_these_gr 
eat_llama_3_models_3/ 


oo 


. https://www.reddit.com/r/LocalLLaMA/comments/1aeftyu/upgrading_gtx_1060_6gb_to_rtx 
3070 _ti_8gb_is_good/ 


oO 


O 


N 


(oe) 


© 


_upgrade/ 


. https://www.reddit.com/r/LocalLLaMA/comments/1c9v4u3/llama_3_8b_any_way_to_run_t 


his _on rtx 3080/ 


. USE Q6_K WHENEVER? 


https://www.reddit.com/r/LocalLLaMA/comments/1c8mvmc/llama_3_discussion_about_qu 
ants_performance/ 


. https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b seems 


more_harmful_compared/ 


. https://www.reddit.com/r/LocalLLaMA/comments/144ucOl/damn_i_was_so_satisfied_with_ 


my_3080_with 10gb_of/ 


. https://www.reddit.com/r/Silly TavernAl/comments/1cu4dd6/can_anyone_recommend_som 


e local models for 3080/ 


. https://www.reddit.com/r/LocalLLaMA/comments/1aezi29/difference_between_ the differen 


t_python_libraries/ 


. https://www.reddit.com/r/LocalLLaMA/comments/1cetn9zZ/quantization_seems_to_hurt_the 


_quality_of_llama_3/ 


. https://www.reddit.com/r/LocalLLaMA/comments/1c9qufe/note_on_llama_3_quantized_mo 


dels/ 


https://huggingface.co/docs/huggingface_hub/en/guides/ 


cli 


ITS RUNNING ON MY CPU 
TO CUDA WE GO 


pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp- 


python/whl/cu124 --upgrade --force-reinstall --no-cache-dir 


useful commands 


watch ur nvidia card 


watch -n 1 nvidia-smi 


takes a screenshot: 


nvidia-smi 


check cuda version 


nvcc --version 


commands 1 by 1 
https://developer.nvidia.com/cuda-12-4-0-download-archive? 
target_os=Linux&target_arch=x86_64&Distribution=WSL- 
Ubuntu&target_version=2.0&target_type=deb_local 


https://developer.nvidia.com/cuda-12-4-0-download-archive? 


=deb_ local 


https://www.reddit.com/r/KoboldAl/comments/160p2jv/can_someone _ eli5 how to calculate _th 
e_number_of/ 


https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#wsl 
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions 


CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python 


CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" \ 
pip install llama-cpp-python 


FINAL COMMAND 


pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp- 


python/whl/cu124 --upgrade --force-reinstall --no-cache-dir 


https://github.com/ggerganov/llama.cpp/issues/6898 


checking for cuda existence 


mako77@DESKTOP-UMFPB6H:~$ 1s /usr/local/cuda 

DOCS README compute-sanitizer extras include libnvvp nvml share 
targets version.json 

EULA.txt bin doc gds 11b64 nsightee_ plugins nvvm src 
tools 

mako77@DESKTOP-UMFPB6H:~$ 1s /usr/local/ 

bin cuda cuda-12 cuda-12.4 cuda-12.5 etc games include lib man sbin 
share src 

mako77@DESKTOP-UMFPB6H:~$ export CUDA_HOME=/usr/local/cuda-12.4 
mako77@DESKTOP-UMFPB6H:~$ which nvcc 

/usr/local/cuda-12.4/bin/nvcc 

mako77@DESKTOP-UMFPB6H:~$ echo $CUDA HOME 

/usr/local/cuda-12.4 


SETTING UP LLAMA CPP 


HANDLING THE CPU BEING 100% 


YOU NEED TO PAY ATTENTION TO ENVINRONMENT 
VARIABLES, LEARN TO PROPERLY USE THEM, SET THEM, 
FIND THEM, ETC. 


| do believe this is one of the things that saved me 


CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall -- 


upgrade --no-cache-dir 


# Linux and Mac 
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" \ 
pip install llama-cpp-python 


Resources for CPU BEING 100% 


(1) Rebuilding, at first | did not build with cuda home or paths and stuff set. 


glaudiston commented on Jul 28, 2023 + edited ~ 


To build the libllama.so with gpu support you need to have CUDA SDK installed, then: 


git clone https://github.com/ggerganov/llama.cpp 
cd llama.cpp 

export CUDA_HOME=/your/cuda/home/path/here 
export PATH=${CUDA_HOME}/bin:$PATH 

export LLAMA_CUBLAS=on 

make clean 

make libllama.so 


Then note that the g++ compiler will add the -DGGML_USE_CUBLAS compiler flag. 
and it will create a file called libllama.so in the current directory. 
check it with 


ls -1 libllama.so 


After that you can force llama-cpp-python to use that lib with: 


export LLAMA_CPP_LIB=/path/to/your/libllama.so 


After that, it worked with GPU support here. Of course you have to init your model with something like 


llm = Llama( 


n_gpu_layers=2@, 


Hope it helps. 
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After doing all that, 


ALSO MAKING SURE I INSTALLED CUDA, BECAUSE | DID NOT KNOW I WOULD NEED 
CUDA LOL 


| re-quantized the model after this, meaning | deleted the previous one and did it with these 
conditions fulfilled. 


After re-quantizing 


# Linux and Mac 
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" \ 
pip install llama-cpp-python 


| guess it did say to do this prior, 


The official docs of this wrapper suggest this: 


Upgrading and Reinstalling 


To upgrade and rebuild llama-cpp-python add --upgrade --force-reinstall --no-cache- 
dir flags to the pip install command to ensure the package is rebuilt from source. 


Which was almost right, 
but ultimately what worked for me was plainly making sure | pip uninstalled the package 


then ran 


CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall -- 


upgrade --no-cache-dir 


Links that helped --- 


found solution here 


huge helped me with the idea to rebuild everything. When in doubt just rebuild 


https://stackoverflow.com/questions/78415856/detecting-gpu-availability-in-llama-cpp-python 
Definitely need to set env variables first i imagine 


| did do this also when running the build 


https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#cuda 
really feel like it didnt do much, but who knows i did it so its included 


this is where i found the command 


https://github.com/ggerganov/llama.cpp/issues/6360 


